This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
When manipulating data, pivoting can be understood to refer to one of two processes: 1. the creation of pivot tables, which are tables “… of statistics that summarize the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way… They arrange and rearrange (or”pivot“) statistics in order to draw attention to useful information. This leads to finding figures and facts quickly making them integral to data analysis.” https://en.wikipedia.org/wiki/Pivot_table#.
The former is a crucial step in data analysis, and is covered elsewhere (link to dplyr, summarise, etc). In this section, we will focus on the latter definition.
https://datacarpentry.org/r-socialsci/03-dplyr-tidyr/index.html
Transforming a dataset from wide to long
Data are often entered and stored in a format that might be useful for presentation, but not for analysis. Let us take the count_data dataset as an example, which is stored in a “wide” format, which means that each column is a variable and each row an observation. This is useful for presenting the information in a table or for entering data (e.g. in Excel) from case report forms. However, these typically needs to be transformed to “long” format in order to analyse and visualise.
DT::datatable(count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Each observation in this dataset refers to the malaria counts at one of 65 facilities on a given date, ranging from 2019-03-18 to 2019-06-14. These facilties are located in one Province (North) and four Districts (Spring, Bolo, Dingo, and Barnard). The dataset provides the overall counts of malaria, as well as age-specific counts in each of three age groups - <4 years, 5-14 years, and 15 years and older.
Visualising the overall malaria counts over time poses no difficulty with the data in it’s current format:
ggplot(count_data) +
geom_col(aes(x = data_date, y = malaria_tot))However, what if we wanted to display the relative contributions of each age group to this total count? In this case, we need to ensure that the variable of interest (age group), appears in the dataset in a single column that can be passed to {ggplot2}’s “aesthetics” (aes()) function.
Consider also using the common problem whereby data are stored with dates as the columns, as in tidyr::table4a
tidyr::table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766pivot_longer()First, let’s begin by loading our packages and converting count_data to a tibble for easy printing:
pacman::p_load(tidyverse)
# Convert count_data to `tibble` for better printing
count_data <-
count_data %>%
as_tibble()
count_data
## # A tibble: 3,038 x 10
## location_name data_date submitted_date Province District `malaria_rdt_0-…
## <chr> <date> <date> <chr> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 11
## 2 Facility 2 2019-06-13 2019-06-14 North Bolo 11
## 3 Facility 3 2019-06-13 2019-06-14 North Dingo 8
## 4 Facility 4 2019-06-13 2019-06-14 North Bolo 16
## 5 Facility 5 2019-06-13 2019-06-14 North Bolo 9
## 6 Facility 6 2019-06-13 2019-06-14 North Dingo 3
## 7 Facility 6 2019-06-12 2019-06-14 North Dingo 4
## 8 Facility 5 2019-06-12 2019-06-14 North Bolo 15
## 9 Facility 5 2019-06-11 2019-06-14 North Bolo 11
## 10 Facility 5 2019-06-10 2019-06-14 North Bolo 19
## # … with 3,028 more rows, and 4 more variables: `malaria_rdt_5-14` <int>,
## # malaria_rdt_15 <int>, malaria_tot <int>, newid <int>Next, we want to use {tidyr}’s pivot_longer() function to convert the wide dataset to a long format, converting the four columns with data on malaria counts to two new columns: one which captures the variable name and one which captures the values from the cells. Since these four variables all begin with the prefix malaria_, we can make use of the handy function starts_with().
df_long <-
count_data %>%
pivot_longer(
cols = starts_with("malaria_")
)
df_long
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid name value
## <chr> <date> <date> <chr> <chr> <int> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 11
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 12
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 23
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 46
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 11
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 10
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 5
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 26
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 8
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 5
## # … with 12,142 more rowsHowever, we could also have specified the columns by position:
count_data %>%
pivot_longer(
cols = 6:9
)
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid name value
## <chr> <date> <date> <chr> <chr> <int> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 11
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 12
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 23
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 46
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 11
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 10
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 5
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 26
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 8
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 5
## # … with 12,142 more rowsor by named range:
count_data %>%
pivot_longer(
cols = `malaria_rdt_0-4`:malaria_tot
)
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid name value
## <chr> <date> <date> <chr> <chr> <int> <chr> <int>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 11
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 12
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 23
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malari… 46
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 11
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 10
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 5
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malari… 26
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 8
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malari… 5
## # … with 12,142 more rowsNotice that the newly created dataframe (df_long) has more rows (12,152 vs 3,038); it has become longer. In fact, it is precisely four times as long, because each row in the original dataset now represents four rows in df_long, one for each of the malaria count observations (<4y, 5-14y, 15y+, and total).
In addition to becoming longer, the new dataset has fewer columns (8 vs 10), as the data previously stored in four columns (those beginning with the prefix malaria_) is now stored in two. These two columns are given the default names of name and value, but we can override these defaults to provide more meaningful names, which can help remember what is stored within, using the names_to and values_to arguments. Let’s use the names age_group and count:
df_long <-
count_data %>%
pivot_longer(
cols = starts_with("malaria_"),
names_to = "age_group",
values_to = "counts"
)
df_long
## # A tibble: 12,152 x 8
## location_name data_date submitted_date Province District newid age_group
## <chr> <date> <date> <chr> <chr> <int> <chr>
## 1 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 2 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 3 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 4 Facility 1 2019-06-13 2019-06-14 North Spring 1 malaria_…
## 5 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 6 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 7 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 8 Facility 2 2019-06-13 2019-06-14 North Bolo 2 malaria_…
## 9 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malaria_…
## 10 Facility 3 2019-06-13 2019-06-14 North Dingo 3 malaria_…
## # … with 12,142 more rows, and 1 more variable: counts <int>We can now pass this new dataset to {ggplot2} to display the malaria counts by age group:
ggplot(df_long) +
geom_col(
aes(x = data_date, y = counts, fill = age_group)
)Have a look at the plot - what is wrong here? We have encountered a common problem - we have also included the total counts from the malaria_tot column, so the magnitude of each bar in the plot is twice as high as it should be.
We can handle this in a number of ways. We could simply filter it from the dataset we pass to {ggplot2}:
df_long %>%
filter(age_group != "malaria_tot") %>%
ggplot() +
geom_col(
aes(x = data_date, y = counts, fill = age_group)
)Alternatively, we could have excluded this variable when we ran pivot_longer, thereby maintaining it in the dataset as a separate variable:
count_data %>%
pivot_longer(
cols = `malaria_rdt_0-4`:malaria_rdt_15,
names_to = "age_group",
values_to = "counts"
) %>%
ggplot() +
geom_col(
aes(x = data_date, y = counts, fill = age_group)
)dplyr pivot_wider()